feat(adapter/nemo): cleanup checkpoint container on_train_end by Leahlijuan · Pull Request #94 · google/ml-flashpoint

Leahlijuan · 2026-03-30T21:13:01Z

This change adds on_train_end in the MLFlashpointCheckpointCallback, which will shutdown replication manager once the ML-Flashpoint checkpoint save done, and remove the mlf checkpoint container.

src/ml_flashpoint/adapter/nemo/checkpoint_callback.py

g-husam · 2026-04-02T14:37:28Z

src/ml_flashpoint/adapter/nemo/wrapper_util.py

        return

+    for cb in mlf_callbacks:
+        cb.replication_manager = replication_manager


should we just initialize this earlier to pass it to the callback? Or do you think this is better? It might complicate the recipes a little which isn't great, but would be more straightforward

keep it for now so that we don't need to modify recipes

train end.

src/ml_flashpoint/adapter/nemo/checkpoint_callback.py

tests/adapter/nemo/test_checkpoint_callback.py

github-actions · 2026-04-07T14:20:09Z

Python Code Coverage Summary

Package	Line Rate	Branch Rate	Health
src.ml_flashpoint	100%	100%	✔
src.ml_flashpoint.adapter	100%	100%	✔
src.ml_flashpoint.adapter.megatron	97%	95%	✔
src.ml_flashpoint.adapter.nemo	98%	94%	✔
src.ml_flashpoint.adapter.pytorch	99%	92%	✔
src.ml_flashpoint.checkpoint_object_manager	93%	93%	➖
src.ml_flashpoint.core	95%	92%	✔
src.ml_flashpoint.replication	81%	81%	❌
Summary	95% (2364 / 2493)	92% (559 / 610)	➖

Minimum allowed line rate is 90%

github-actions · 2026-04-07T14:20:11Z

C++ Code Coverage Summary

Package	Line Rate	Branch Rate	Health
src.ml_flashpoint.checkpoint_object_manager.buffer_object	93%	54%	✔
src.ml_flashpoint.checkpoint_object_manager.object_manager	69%	33%	❌
src.ml_flashpoint.replication.transfer_service	79%	40%	❌
Summary	81% (924 / 1142)	43% (698 / 1638)	➖

Minimum allowed line rate is 80%

g-husam · 2026-04-07T14:45:33Z

src/ml_flashpoint/checkpoint_object_manager/object_manager/object_manager.cpp

 namespace fs = std::filesystem;

 namespace {
+// We use a fork/exec approach calling 'rm -rf' here instead of


interesting, is this the only way to get around that? do we know why there was a seg fault?

g-husam · 2026-04-07T14:47:06Z

src/ml_flashpoint/checkpoint_object_manager/checkpoint_object_manager.py

            if os.path.isdir(container_id):
                # Use shutil.rmtree for recursive deletion.
-                shutil.rmtree(container_id)
+                shutil.rmtree(container_id, onerror=_onerror)


a note: this delete_container function is synchronous, and doesnt use the async C++ impl for delete dir, so we should be careful of when we use each. We might want to make this call that one, and allow for blocking, for consistency

As we delete_container before it's fully finished (save + replication), so the transfer service might make changes to the dir at the same time (in the receiver, we save to a tmp file first and then rename it), so there could be file not found error.

g-husam · 2026-04-07T14:51:35Z

src/ml_flashpoint/replication/transfer_service/connection_pool.h

  int peer_port_;
  size_t max_size_;
-  std::queue<int> available_connections_;  // Guarded by mtx_.
+  std::queue<int> available_connections_;       // Guarded by mtx_.


can you explain the rationale for using a queue here and unordered set below? specifically why FIFO order is relevant for available_connections.

also are these two collections mutually exclusive?

yes, they should be be mutually exclusive. For the queue usage for available_connections_, ideally no difference if we using a queue or a stack, just a way to track them and get one to use quickly.

adding active_connections_ as we want to destroy all the alive connection during shutdown

g-husam · 2026-04-07T14:58:05Z

src/ml_flashpoint/replication/transfer_service/connection_pool.cpp

  }
  if (reuse) {
    if (available_connections_.size() < max_size_) {
      LOG(INFO) << "ConnectionPool::ReleaseConnection: reuse connection";


these INFO logs here and below are debug logs, I would avoid using INFO for them so we dont pollute the logs. Maybe use VLOG(3).

Can follow this guideline:

Level Usage Case Frequency

VLOG(1) High-level flow: Major state changes or important function entries (e.g., "Pool initialized", "Connection created"). Low

VLOG(2) Detailed events: Per-request or per-connection logic (e.g., "Connection X added to active set"). Medium

VLOG(3) Deep tracing: Logic branches inside loops or complex conditionals. High

VLOG(4)+ Extremely noisy: Byte-level data, heartbeats, or internal mutex locking/unlocking details. Very High

Leahlijuan requested review from g-husam and kkkapu March 30, 2026 21:14

g-husam reviewed Apr 2, 2026

View reviewed changes

g-husam changed the title ~~Feat/cleanup: cleanup checkpoint container on_train_end~~ feat(adapter/nemo): cleanup checkpoint container on_train_end Apr 2, 2026

Leahlijuan added 3 commits April 2, 2026 15:41

feat: shutdown replication manager and delete mlf checkpoint dir on

fc2251b

train end.

ensure mlf checkpoint done before shutdown replication manager

aaf42d0

resolve comments

1897d3d

Leahlijuan force-pushed the feat/cleanup branch from 1d2542c to 1897d3d Compare April 2, 2026 15:42

Leahlijuan requested a review from g-husam April 6, 2026 15:01

g-husam reviewed Apr 6, 2026

View reviewed changes

Leahlijuan added 2 commits April 6, 2026 17:13

resolve comments

66786fb

resolve comments

9afab16

Leahlijuan requested a review from g-husam April 6, 2026 17:18

g-husam approved these changes Apr 6, 2026

View reviewed changes

tests/adapter/nemo/test_checkpoint_callback.py Outdated Show resolved Hide resolved

resolve comments

cb755a1

g-husam reviewed Apr 6, 2026

View reviewed changes

tests/adapter/nemo/test_checkpoint_callback.py Outdated Show resolved Hide resolved

Need to shutdown all connections when shutdown transfer service

3b11591

Leahlijuan requested a review from g-husam April 7, 2026 13:48

Leahlijuan added 2 commits April 7, 2026 14:04

revert mlf_log_siink_test.cpp

4069374

resolve comment

d6792bb

g-husam reviewed Apr 7, 2026

View reviewed changes

Level	Usage Case	Frequency
`VLOG(1)`	High-level flow: Major state changes or important function entries (e.g., "Pool initialized", "Connection created").	Low
`VLOG(2)`	Detailed events: Per-request or per-connection logic (e.g., "Connection X added to active set").	Medium
`VLOG(3)`	Deep tracing: Logic branches inside loops or complex conditionals.	High
`VLOG(4)+`	Extremely noisy: Byte-level data, heartbeats, or internal mutex locking/unlocking details.	Very High

Conversation

Leahlijuan commented Mar 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 7, 2026

Python Code Coverage Summary

Uh oh!

github-actions bot commented Apr 7, 2026

C++ Code Coverage Summary

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants